TCAD Project

By Kelly Chu

GitHub Repository

https://github.com/jameshowison/tcad_capstone.git

The link above is a repository contains all of project's files and each file's revision history. However, the data and final deliverable file is too large that can not be stored in the GitHub, but could be run by this Jupyter notebook under Create and Export the deliverable table.

Introduction

This project, started by a faculty colleague in urban studies, has a data dump from Travis Central Appraisal District (TCAD) on Austin property valuations in 1990 that exported around 2004. Where TCAD is the governmental entity responsible for appraising and assessing the value of all real and personal property in Travis County. Overall, the total size of these files are under 600MB uncompressed. The colleague would like to analyze these databases for his research, so the goal for this project is to gather the information and present the deliverable that would help in the research in the form that the faculty colleague can access it in a format that preferably in an excel sheet for all data. This project is a prototypical combo of archives, databases/data engineering, and some knowledge on geographic information system work.

From the given annotated microfiche, it labeled all the tasks with an example of the following information: Parcel, lots size, use, sq ft , effect date built, deed date, land appraised value, improvement appraised value, owner name, mailing address, location, and Legal description. The final outcome to the faculty colleague will present a table in excel sheet that specifically extracts the data of these columns.

The screenshots of annotated microfiche by the client:

Image 1: Annotated microfiche parcel

Lot%20size.png

EYOC.png

Deed%20date.png

Deed.png

Appr_value%20and%20Impr_value.png

name%20and%20mailing.png

Location.png

Legal%20description.png

TCAD file exploration

Within the folders, it contains 3 types of files:

There are 2 general types of files, one is TCBC and the other is TXBC. From many searches, I assumed that TCBC is the abbreviation for Travis County Basic Code, and TXBC is the Travis County Extra-territory Basic Code. Where in these two general files there are some with suffixes following:

However, by looking at the file description, noticing that the data only provided are TCBC_SUM_1990, TCBC_SUM_1990_JURIS, TCBC_SUM_1990_LEGAL, TXBC_SUM_1990, TXBC_SUM_1990_JURIS, TXBC_SUM_1990_JURIS_EXMP, TXBC_SUM_1990_LEGAL.

Other than the datasets provided by TCAD, I also created the data for the use code where the source is from page 59 of the Reappraisal Plan for Tax Years 2015-2016 by the Texas Comptroller of Public Account (Image 9). I have followed the formatting of .TXT and . TDF files the same way that was given, it is named TXBC_SUM_1990_USECODE. So, it allows me to upload the file into a folder where all files were unzipped and load the data all together in both folders A and B.

Method and Tools

To accomplish the tasks in this project, we have used Google Sheets to save our findings and list out the notes that can be shared within the teams. To unzip and read the files given, we used Python from JupyterHub under the EduPod server and imported the extension called DuckDB for running the SQL code in the Jupyter Notebook.

Before using the tools above we have tried using Snowflakes and DBTCloud, using these two platforms we have difficulties to load the data that were given into its system. This may caused by lack of understanding of using these two platforms. We also tried using the Google Colab as the coding platform, but to be more accessible on collaboration work we use Jupyter notebook instead and practice using Git and load all documents onto GitHub repositoy that allow us to view or call back any edit history.

Shorten files for browsing

To shorten the files for browsing we can run a short shell script. This opens the zip that was received, and truncates each file at 100 lines long.

```{bash, eval=F}

rm -rf shortened_appraisal_files

unzip original_data/Appraisal_Roll_History_1990.zip -d shortened_appraisal_files find shortened_appraisal_files -name ".TXT" -exec sed -i.full 100q {} \; find shortened_appraisal_files -name ".TXT.full" -exec rm {} \; zip -r shortened_appraisal_files.zip shortened_appraisal_files ```

We can now attempt to load a shortened file using pandas

Manually add the UseCode files into the zipped folder.

Challenge now is to use the *.TDF files to create tables. I can think of two approaches.

  1. The TDF files are SQL, so if those are fed to duckdb they should be able to create tables into which the TXT pipe-separated CSV files can be read. There may be issues with the datatypes not matching (which would require mapping the current datatype definitions to duckdb datatypes by changing the words used to give the datatype to the columns).

  2. Take the column names out of the TDF files and add them as the column names while reading the relevant CSV files into duckdb. This would use duckdb's auto understanding of the column datatypes (so it would run, but it might guess wrongly and truncate or change data).

Creating tables using the TDF files

We have TDF files scattered through the _A and _B folders. I have created a schema (a namespace) for the files from _A called "folder_A" and "folder_B". So there are tables named the same thing in each of the schemas. You can reference the tables as folder_A.TCBC_SUM_1990_JURIS and folder_B.TCBC_SUM_1990_JURIS

We can use python to read each TDF file separately, create the table and then try to load the matching TXT file. A little guidance on how to process a directory structure of files using Path and glob here: http://howisonlab.github.io/datawrangling/faq.html#get-data-from-filenames

Create the tables for dbdocs

By using dbdiagram we are able to generate an Entitiy Relation Diagram (ER Diagram). When using this platform, its require to have space in front of the content using the oucome created above. In this project, I've created an ER diagram focusing on the TXBC files is having data.

https://dbdiagram.io/d/64440b406b3194705105fb93

er_diagram.png

Data / Files exploration

Interestingly, these two folders (Appraisal_Roll_History_1990_A and Appraisal_Roll_History_1990_B) contain the exact same files and in all files it has exactly the same information. This can be shown by running a python code that compare the folder:

SQL for analysis

Hey, duckdb implements all the same information schema names as postgres, so one can use the same queries to find the tables with their schaema names.

The table without a suffix (TCBC_SUM_1990) has only 28,086 rows. Perhaps these are accounts for individual tax payers, but individual tax payers can have multiple account number.

Skip down to middle data of to look into detail information of the TCBC summary file.

The table without a suffix (TXBC_SUM_1990) has 255,593 rows. Perhaps these are accounts for individual tax payers, but individual tax payers can have multiple parcel.

The unique key is by Parcel and OwnrId

TCBC_SUM_1990_JURIS - Suppose total of 134933 rows, rows are adding up everytime rerun (fixed now)

JURIS probably means "jurisdiction" which means a legal area. This makes sense because the columns are about tax rates (and metadata about tax status, like 'freeport'). So possibly this file is a list of jurisdictions to which a parcel can belong (and therefore holds the rates that would apply to the parcel?). It is surprising to have 134,933 different jurisdictions though!

Looking for possible column that have relation to the location, then select possible needed information.

Searching for columns that includes 'arcel' for parcel number:

In the TCBC files, the only columns that relate with parcel is the LinkParcel.

Mainly the parcel is located throughout all TXBC type files.

There is no parcel relation in TCBC files knowing the only possible outcome is None from the LinkParcel column. Therefore, the only parcel number is under TXBC files.

Searching for the columns that can possibly find the data of "sq ft":

Both TCBC and TXBC files with no suffix (_SUM_1990) have the column "TotSqft" and may be the data we are searching for.

Looking at all the possible values, TCBC file only have value of "0", and TXBC file have 8,866 types of outcomes. Assuming the record that does not have the square feet are all recorded as "0".

Searching columns that have relation with the use:

TCBC_SUM_1990_SUSP - UseInfoAddrFlag

TXBC_SUM_1990 - AgUseCode, AgUseMulti, UseCode, UseMulti, UseClass

TXBC_SUM_1990_SUSP - UseInfoAddrFlag

There's no data in all files with suffix of SUSP.

The important information that is seeking for use is the UseCode in TXBC_SUM_1990 file. Where the use code is a two digit number and it might just a code that match with other information.

The use code that founded and entered as the same format of other data.

EYOC is effective year of construction, which could be the effective date built. But looking at the table below, there are some confusion or uncertainties.

  1. There is 'None' and '0000' as the possible outcome, what would be the differences?
  2. One of the record is 2971 and another one is 0984, which this didn't make sense. Is there possible typo in this record?

Looking for all deed information

Land appraised values and Improve appraised values

Owner name in TCBC file and TXBC file

Mailing address from TXBC files

Concentrate looking for the location information that is important in the research. Where the FmtLoc present the full address of the record and the other columns (LocStreet, LocHouse, LocFrac, LocAlpha, LocUnit, and LocZip) are the splited address information. This applies to both TCBC and TXBC records.

While below shows the sample table from folder_A_TCBC schema of TCBC_SUM_1990 file.

Legal description from TXBC_SUM_1990_LEGAL

Thinking if area would possible be the lot size

Create and Export the deliverable table

The preview of top 5 and last 5 deliverables

The preview of 150000 to 150005 and 150095 to 150100 deliverables

Exporting the deliverables into csv files, but there's some formatting issue

Save the deliverable table into dataframe and exporting to Excel (.xlsl)

Findings

The deliverable from the annotated microfiche are mostly found in the TXBC (Travis County Extra-territory Basic Code) files. However, the client is not looking for extra-territory records. This makes some confusion about whether or not the TXBC represents the Travis County Extra-territory Basic Code. If not, what is the difference between TXBC and TCBC?

Parcel Number with OwnerId

Looking at the parcel number on the microfiche (Image 1), the sample is having the format of XX-XXXX-XXXX-XXXX. And from the data in TXBC_SUM_1990, the Parcel column is presented as a 10-digit number, and OwnrId is a 4-digits number. It can be figured out that the parcel number from the microfiche is split up into 'Parcel' and 'OwnrId' columns of TXBC, where the first 10-digits are 'Parcel' and the last 4-digits is 'OwnrId'. Therefore, the unique key for the TXBC files is the combination of 'Parcel' and 'OwnrId'.

Lot size

Unfortunately, we have not yet discovered the accurate data for the lot size. The sample lot size from microfiche (Image 2) is 62.5000X101.5000.

Square feet

Throughout all data, we only found one column that contains "sqft" in the column labels. Therefore, we assume TotSqft from the TXBC_SUM_1990 file is the reasonable data for sq ft in the microfiche. The reason we have to assume the TotSqft as the matching information in the microfiche is that the outcomes are widely spread from 4 to 2614058 other than 0. Although there is also TotSqft in the TCBC_SUM_1990 file, the only outcome is '0', which is not useful at all.

Use code

The use code is a 2-digit number code in the TXBC_SUM_1990 file, it should be matching with other information, but the description of this code was not provided in the datasets given. Luckily, I was able to find a document published for the Reappraisal Plan for Tax Years 2015-2016 by the Texas Comptroller of Public Account. Page 59 of this document (Image 9), recorded the Use Codes with its description that was organized by the type of use, we are taking this information and forming it into tables so we can connect between tables. However, the data contains a pretty decent amount of records '00' and 'None' that couldn't be identified. With the information that was created for the additional text file (TXBC_SUM_1990_USECODE), it could load the use code description and its category into the table and match the code from TXBC_SUM_1990.

Use%20code%20image.png

EYOC (effective year of construction or year built)

The same document by the Texas Comptroller of Public Account describes EYOC, which is the "year built and the effective year of construction" (TCAD 2014), and this information is in the TXBC_SUM_1990 file. This matches with one of our deliverables of effect date built on the microfiche. However, I believe that there's some error with the data that was given to us from the TCAD and that there is a possible outcome with the years 2971 and 0984. And in the data of EYOC, there are also both None and 0000 and they are probably presenting the same meaning that they do not have the data for some records.

Deed information (volume, page, and date)

Any deed information can only be found in the TXBC_SUM_1990 file. The deed volume and page number only have one outcome throughout, for both are all presented with "00000". And the deed date only has one record of "1900-00-00". The reason resulting in this information is possibly due to different systems of collection of appraisal in 1900 and exportation in 2004.

Land appraised value and improvement appraised value

Part of the important deliverable is the land appraised value and the improvement appraised value, and this information can only be found in the TXBC_SUM_1990_JURIS file. Some interesting information is the value of LandMktVal seems to match with LandApprdVal, and ImprMktVal matches with the ImprApprdVal. And the total appraised value (TotApprdVal) is the sum of the land appraised value and improvement appraised value, which can be found in the TXBC_SUM_1990_JURIS file.

Owner name

The owner name (OwnerName) is the basic information that can be found in both TCBC_SUM_1990 and TXBC_SUM_1990. The records are different between these two files, but since the rest of the deliverables are from the TXBC files, it is more efficient to look at the TXBC_SUM_1990 for the records, unless working with the TCBC dataset.

Mailing address

The mailing address has split into a maximum of 5 cells from the TXBC_SUM_1990 file. By viewing the MailCnt before reading the mailing address (MailAddr1, MailAddr2, MailAddr3, MailAddr4, and MailAddr5) it can tell the information of how many cells the mailing address is split into.

Location

Location can also be found in both TCBC_SUM_1990 and TXBC_SUM_1990 datasets. With consistent reasoning, it is more efficient to continue using the TXBC file. There is a column of data that records the entire location, and six other columns are the split record of this appraisal history. If looking at the overall location information for the research, it can use the column called FmtLoc, and if would like to organize and get information about the location individually, then it is more efficient to just read separately for LocStreet, LocHouse, LocFrac, LocAlpha, LocUnit, andLocZip.

Legal descriptions are stored in both TXBC and TCBC files that have the suffix LEGAL (TXBC_SUM_1990_LEGAL and TCBC_SUM_1990_LEGAL). Since the unique keys that are used are the Parcel and OwnrId, therefore we are using the TXBC information instead of TCBC. The pattern of the legal description has not yet been discovered with LegalSub1, LegalSub2, LegalCd1, LegalLn1, LegalCd2, LegalLn2, LegalCd3, LegalLn3, LegalCd4, LegalLn4, LegalCd5, LegalLn5, LegalCd6, and LegalLn6.

Challenges and discussions

At the start of this project, we faced some difficulties in understanding the content of the data given. I took lots of time just browsing on the internet to look for the description of abbreviations that includes in the data, just like TCBC and TXBC means Travis County Basic Code and Travis County Extra-territory Basic Code, this definition is the most reasonable from my searches, but it still has the possibility that is representing something else. Another challenge that we haven't discovered is the data for lot size, this is one the important information for the client's research. To find the solution, we probably need to look into the TCAD website to find if there's any other data matching, this may need to dive in by individual records.

Some further discussion can be:

Reference (MLA)

"Reappraisal Plans for Tax Years 2015 & 2016." Comptroller.texas.gov, TRAVIS CENTRAL APPRAISAL DISTRICT, 19 Aug. 2014, comptroller.texas.gov/taxes/property-tax/reappraisals/. Accessed 14 Apr. 2023.